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Abstract. For both the Lempel Ziv 77- and 78- factorization we propose algorithms generating the 
respective factorization using (1 + e)nlgn + 0(n) bits (for any positive constant e < 1) working space 
(including the space for the output) for any text of size n over an integer alphabet in Oinfe^'f time. 


1 Introduction 

It is difficult to find any practical scenario in computer science for which one could not reason 
about compression. Although common focus lies on compression of data on disc storage, for some 
usages, squeezing transient memory is also practically beneficial. For instance, the zram module of 
modern Linux kernels [35] compresses blocks of the main memory in order to prevent the system 
from running out of working memory. Compressing RAM is sometimes more preferable than storing 
transient data on secondary storage (e.g., in a swap file), as the latter poses a more severe perfor¬ 
mance loss. Another example are websites that usually transferred as “gzipped” data by hosting 
servers [33| ■ A server may cache generated webpages in a compressed form in RAM for performance 
benefits. To sum up, a common task of these scenarios is the compression and maintenance of data 
in main memory in order to provide a space-economical, fast access. 

Central in many compression algorithms are the LZ77 m or LZ78 [38| factorizations. Both 
techniques were invented in the late 70’s and set a milestone in the field of data compression. 
Since main memory sizes of ordinary computers do not scale as fast as the growth of datasets, 
insufficient memory is a well-aware problem; both huge mainframes with massive datasets and tiny 
embedded systems are valid examples for which a simple compressor may end up depleting all RAM. 
Besides, they have also been found to be a valuable tool for detecting various kinds of regularities 
in strings [il[6l [mi2Tll23ll26] . for indexing [7l[Tni[IIl[l8l[T9l[3T] and for analyzing strings [51 1^1^ . 

Large datasets pose a challenge to the main memory budget. For a solution, one either has 
to think about algorithm engineering in external memory, or about how to slim down memory 
consumption during computation in RAM. Wrt. the latter, we propose an approach that uses 
(1 -|- e)nlgre -|- 0{n) bits (for any positive constant e < 1) working space (including the space for 
the output) while sustaining linear time computation. Our approach differs from the more recent 
algorithms (see below), as it uses a succinct suffix tree representation. 

Related Work. While there are naive algorithms that take 0(1) working space with quadratic 
running time (for both LZ77 and LZ78), linear time algorithms with very restricted space emerged 
only in recent years. 

Wrt. LZ77, the bound of 3nlgn bits set by [l2] was very soon lowered to 2nlgn by [Ej. For 
small alphabet size a, the upper bound of n Ig n-|-0(cr Ig n) bits by [T3| is also very compelling. Their 
common idea is the usage of previous- and/or next-smaller-value-queries [33|. While the approach 
of Karkkainen et al. m stores SA and NSV completely in two arrays. Goto et al. m can cope 


with a single array whose length depends on the alphabet size. In [20], a practical variant having 
the worst case performance guarantees of (1 + e)nlgn + n + 0{a\gn) bits of working space and 
0(nlg(T/e^) time was proposed. 

Wrt. LZ78, by using a naive trie implementation, the factorization is computable with 0{zlgz) 
bits space and 0{nlga) overall running time, where z is the size of LZ78 factorization. More 
sophisticated trie implementations [9| improve this to O (n + zlg^ Ig a /Ig Ig Ig a) time using the 
same space. 

Jansson et al. m proposed a compressed dynamic trie based on word packing, and showed an 
application to LZ78 trie construction that runs in 0(n(lg cr + Ig Ig^ n)/ Ig^. n) bits of working space 
and 0(nlg^lgn/ (Ig^-nlglglgn)) time. When Iga = o(lgnlglglgn/lg^ Ign), their algorithm runs 
even in sub-linear time, but in the worst case it is super-linear. For an integer alphabet a linear 
time algorithm was recently proposed in [30], which utilizes the fact that LZ78 trie is superimposed 
on the suffix tree of a string. Although their algorithm works in O(nlgn) bits of space, they did 
not care about the constant factor, and the use of the (complicated) dynamic marked ancestor 
queries [I] seems to prevent them from achieving a small constant factor. 


2 Preliminaries 

Let E denote an integer alphabet of size a = \X!\ = An element w in 27* is called a string, 

and \w\ denotes its length. The empty string of length 0 is called e. For any 1 < i < |t(;|, w\i] denotes 
the z-th character of w. When w is represented by the concatenation of x,y,z £ 27*, i.e., w = xyz, 
then X, y and z are called a prefix, substring and suffix of w, respectively. In particular, a suffix 
starting at position z of rc is called the i-th suffix of w. For any 1 < j < \w\, let Sj{w) denote the 
set of substrings of w that start strictly before j. 

In the rest of this paper, we take a string T of length n > 0, which is subject to LZ77 or LZ78 
factorization. For convenience, let T[n] be a special character that appears nowhere else in T, so 
that no suffix of T is a prefix of another suffix of T. Our computational model is the word RAM 
model with word size f7(lgn). Further, we assume that T is read-only; accessing a word costs 0(1) 
time (e.g., T is stored in RAM using n\ga bits). 

The suffix trie of T is the trie of all suffixes of T. The suffix tree of T, denoted by ST, 
is the tree obtained by compacting the suffix trie of T. ST has n leaves and at most n internal 
nodes. We denote by V the nodes and by E the edges of ST. For any edge e G E, the string stored 
in e is denoted by c(e) and called the label of e. Further, the string depth of a node z; S R is 
defined as the length of the concatenation of all edge labels on the path from the root to v. The 
leaf corresponding to the z-th suffix is labeled with z. SA and ISA denote the suffix array and the 
inverse suffix array of T, respectively m- For any I < z < n, SA[z] is identical to the label of the 
lexicographically z-th leaf in ST. LCP and RMQ are abbreviations for longest common prefix and 
range minimum query, respectively. LCP is a DS (data structure) on SA such that LCP[z] is the 
LCP of the lexicographically z-th smallest suffix with its lexicographic predecessor for i = 2,... ,n. 

For any bit vector B with length \B\, i?.ranki(z) counts the number of T’-bits in B[l..i], and 
B. selecti(z) gives the position of the z-th ‘1’ in B. Given B, a DS that uses additional o(|i7|) bits of 
space and supports any rank/select query on B in constant time can be built in 0{\B\) time [29] . 

As a running example, we take the string T = aaabaabaaabaa$. Since both algorithms for LZ77 
and LZ78 are based on the suffix tree, we depict the suffix tree of this example string in Fig. [TJ 
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Fig. 1. The suffix tree of T = aaabaabaaabaaS. The leaf labels are displayed by the underlined numbers. The other 
numbers show the pre-order of the nodes. 


2.1 Lempel Ziv Factorization 

A factorization partitions T into z substrings T = fi ■ ■ ■ fz- These substrings are called factors. 
In particular, we have: 

Definition 1. A factorization fi - ■ ■ fz = T is called the LZ77 factorization of T iff fx = 
argmax 5 g 5 ^.(r)ui: \S\ for all 1 < x < z with j = |/i • • • fx-i \ + 1. 

The classic LZ77 factorization adds an additional fresh character to the referencing factors such 
that the following definition holds: 

Definition 2. A factorization fi ■ ■ ■ fz = T is called the classic LZ77 factorization of T iff fx 
is the shortest prefix of fx - ■■ fz that occurs exactly once in fi ■■■ fx- 

Definition 3. A factorization fi ■ ■ ■ fz = T is called the LZ78 factorization of T iff fx = fx ' ^ 
with fx = argmaxg^^j-^.y^x;}u{£} I'^'l c G U for all 1 < x < z. 

We identify factors by text positions, i.e., we call a text position j the factor position of fx 
(1 < X < 2 ) iff factor fx starts at position j. A factor fx may refer to either (LZ77) a previous text 
position j (called fx’s referred position), or (LZ78) to a previous factor fy (called /^’s referred 
factor —in this case y is also called the referred index of fx). If there is no suitable reference 
found for a given factor fx with factor position j, then fx consists of just the single letter T[j]. We 
call such a factor a free letter. The other factors are called referencing factors. 

Our final data structures allow us to access arbitrary factors (factor position and referred 
position (LZ77)/referred index (LZ78)) in constant time. 
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2.2 Data Structures 


Common to both our algorithms is the construction of a succinct ST representation. It consists of 
SA with nlgn bits, LCP with 2n + o(n) bits, and a 2\V\ + o(|C|)-bit representation of the topology 
of ST, for which we choose the DFUDS [3] representation. The latter is denoted by SucST. We 
make use of several construction algorithms from the literature: 

— SA can be constructed in 0(n/e^) time and (1 + e)nlgn bits of space, including the space for 
SA itself [T7] . 

— Given SA, LCP can be computed in 0{n) time with no extra space [36]. Note that LCP can only 
answer LCP[z] in constant time if SA[i] is also available. This is an important remark, because we 
will discard at several occasions SA in order to free space, and this discarding causes additional 
difficulties. 

— Given both SA and LCP, a space economical construction of SucST was discussed in [331 Alg. 1]. 
The authors showed that the DFUDS representation of ST can be built in 0{n) time with 
n + o(n) bits of working space. 

We identify a node v € V with its pre-order number, which is also the order in which the opening 
parentheses occur in the DFUDS representation. So we implicitly identify every node u € U with 
its pre-order number (enumerated by 1 ,..., 1U|). 

Since our ST is static, we can perform various operations on the tree topology in constant time 
(see, e.g., [32l[33|). Among them, we especially use the following operations (for any u G U and 
i G N): parent(u) returns the parent of v; and leveLanc(u, z) returns the z-th ancestor of v. By 
building the min-max tree |32| on the DFUDS of ST in 0{n) time (using 0{n) bits of space), we 
can get SucST supporting these operations in constant time. 

Additionally, we are interested in answering str_depth(z;) on ST; str_depth(z;) returns the string 
depth of u G U. As noted in |33|, an RMQ data structure on LCP can be built in 0{n) time 
and n -I- o{n) bits of working space to support str_depth in constant time. Note that the operation 
str_depth becomes unavailable when SA is discarded. 

Our algorithms in Sect. [3] and 0] make use of two arrays: Ai of size nlgn bits, and a small helper 
array A 2 of size enlgn bits. (We chose such generic names since the contents of these arrays will 
change several times during the LZ-computation.) 

Node-Marking Vectors. In our algorithms, we sometimes deal with subsets V' of V. Pre-order 
numbers enumerating only the nodes in V can naturally be used to map nodes in V to the range 
[I.. |U^|]. For this purpose, we use a node-marking vector My, which is a bit vector of length 
|U|, such that Myi[v] = I iff u G U' for any 1 < z; < |U|. We write pv'{v) '■= My/. ranki(z;) for any 
node u G UC 

3 LZ77 

The main idea is to perform leaf-to-top traversals accompanied by the marking of visited nodes. 
The marked nodes are indicated by a T’ in a bit vector of size |U|. Starting from the situation where 
only the root is marked, in the j-th leaf-to-top traversal for any 1 < j < n, we traverse ST from 
the leaf labeled with j towards the root, while marking visited nodes until we encounter an already 
marked node. Observe that right before the j-th leaf-to-top traversal, each string of Sj{T) can be 
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obtained by following the path from the root to some marked node. Hence, the LZ77 factorization 
can be determined during these leaf-to-top traversals: If j is a factor position of a factor /, the last 
accessed node v during the j-th leaf-to-top traversal reveals f’s referred position. More precisely, 
V is either the root, or a node that was already marked in a former traversal. If v is the root, / is 
a free letter. Otherwise, we call v the referred node of /. Then, the factor length is str_depth(u), 
and the referred position is the minimum leaf label in the subtree rooted at v (retrieved, e.g., by an 
RMQ on SA). Since every visited node will be marked, and a marked node will never be unmarked, 
the total number of parent(-)-operations is upper bounded by the number of nodes in ST, i.e., 0{n). 

3.1 Algorithm 

We start with SA stored in Ai[l..n], and some 0(n)-bit DS to provide SucST, RMQs on SA, and 
RMQs on LCP. Note that the LZ77 computation via leaf-to-top traversals, as explained above, 
accesses ISA n times to fetch suffix leaves that are starting nodes of the traversals, and accesses SA 
0(z) times to compute the factor lengths and the referred positions. Then, if we have both SA and 
ISA, the LZ77 factorization can be easily done in 0(n) time by the leaf-to-top traversals. However, 
allowing only (1 -|- e)n Ig re -|- 0(n) bits for the entire working space, it is no longer possible to store 
both SA and ISA completely at the same time. 

With Extra Output Space. Let us first consider the easier case where the result of the factor¬ 
ization can be output outside the working space. We can then use the array-|-inverse DS of Munro 
et al. [28l Sect. 3.1], which allows us to access inverse array’s values in 0{l/e) time by spending 
additional erelgre bits (on top of the array’s size). Since ISA is accessed more often than SA, we first 
convert SA on Ai into ISA and then create its array-pinverse DS so that accessing ISA and SA can 
be done in 0(1) and 0(1/e) time, respectively. Although it is not explicitly mentioned in [28], the 
DS can be constructed in 0{n) time. Then, the leaf-to-top traversals can be smoothly conducted, 
leading to 0(z/e + n) = 0(n) running time. 

Although this is already an improvement over the currently best linear-time algorithm using 
2relgre bits m. doing so would prevent us from also storing the output of the LZ77 factorization 
in the working space. Solving this is exactly what is explained in the remainder of this section. 

Outline. It is difficult to find space for writing the referred positions; the former algorithm already 
uses (1 -|- e)relgre bits of working space for the array-|-inverse DS. Overwriting it would corrupt the 
DS and cause a problem when accessing SA or ISA. We evade this problem by performing several 
rounds of leaf-to-top traversals during which we build an array that registers every visit of a referred 
node. (A minor remark is that this approach does not even need RMQs on SA.) 

Our algorithm is divided into three rounds of leaf-to-top traversals and a final matching phase, 
all of which will be discussed in detail in the following: 

First Round: Construct a bit vector Bf[l..n] marking all factor positions in T, and a bit vector 
Bt\1..z\ marking the referencing factors. Determine the set of referred nodes Vr <ZV , and mark 
them with a node-marking vector My^- 

Second Round: Construct a bit vector Bp counting (in unary) the number of referred nodes 
from Vr visited during each traversal. 

Third Round: Construct an array D storing the pre-order numbers of all referred nodes visited 
during each traversal (as counted in the second round). 
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Matching: Convert the pre-order numbers in D to referred positions. 

Fig. [2] visualizes the leaf-to-top traversals along with the created data structures Bd and D. 



Fig. 2. The LZ77 factorization partitions T — aaabaabaaabaaS as a|aa|b|aabaa|abaa|$. The shaded nodes are the 
referred nodes. Nodes 5,10 and 14 are referred by / 2 , fi and /s, respectively. During the leaf-to-top traversals: In the 
1st traversal, node 5 is marked; In the 2nd traversal, node 10 is marked, and node 5 is referred to by factor /2 with 
factor position 2; In the 3rd traversal, node 14 is marked; In the 5th traversal, node 10 is referred to by factor /r 
with factor position 5; In the 10th traversal, node 14 is referred to by factor /s with factor position 10; Therefore, 
Bd = 01001011011111011111 and D = [5,10, 5,14,10, 14], where referred entries are depicted by shaded entries. 


Details. In the first round, we compute the factor lengths as before by leaf-to-top traversals, 
which are used to construct Bf. Since the set of referred nodes can be identified during the leaf-to- 
top traversals, can be easily constructed. We also compute B^ by setting Br[x\ •(— 1 for every 
referencing factor fx with \ < x < z. For the rest of the algorithm, the information of SA is not 
needed any longer. 

We now aim at generating the array D storing a sequence of pre-order numbers of referred 
nodes, which will finally enable us to determine the referred positions of each referencing factor. D 
is formally defined as a sequence obtained by outputting the pre-orders of referred nodes whenever 
they are marked or referred to during the leaf-to-top traversals. Hence, each referred node appears 
in D for the first time when it is marked, and after that it occurs whenever it is the last accessed 
node of the j-th traversal, where 1 < j < n coincides with a factor position. To see how D will be 
useful for obtaining the referred positions, consider a node v € V that was marked during the fc-th 
traversal. If we stumble upon v during the j-th traversal (for any factor position j > k) we know 
that k is the referred position for the factor with factor position j (because v had not been marked 
before the fe-th traversal). 

Alas, just D alone does not tell us which referred nodes are found during which traversal. We 
want to partition D by the n text positions, s.t. we know the traversal numbers which the referred 
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nodes belong to. This is done by a bit vector Bjj that stores a T’ for each text position j, and 
intersperses these ‘I’s with ‘O’s counting the number of referred nodes written to D during the 
j-th traversal. The size of the j-th partition (1 < j < n) is determined by the number of referred 
nodes accessed during the j-th traversal. Hence the number of ‘O’s between the (j — l)-th and 
j-th. ‘1’ represents the number of entries in D for the j-th suffix. Formally, Bd is a bit vector 
such that D[jb..jf,] represents the sequence of referred nodes that are written to D during the j- 
th leaf-to-top traversal, where, for any 1 < j < n, jb := ranko(i?£). selecti(j — 1)) -|- 1 and 
je := ranko(H£). selecti(j)). Note that for each factor position j of a referencing factor / we 
encountered its referred node during the j-th traversal; this node is the last accessed node during 
that traversal, and was stored in D[jf>], which we call the referred entry of /. Note that we do 
not create a rankg nor a selecti DS on B^ because we will get by with sequential scans over B^ 
and D. 

Finally, we show the actual computation of B^ and D. Unfortunately, the computation of D 
cannot be done in a single round of leaf-to-top traversals; overwriting Ai naively with D would result 
in the loss of necessary information to access the suffix tree’s leaves. This is solved by performing 
two more rounds of leaf-to-top traversals, as already outlined above: In the second round, with 
the aid of Bjj is generated by counting the number of referred nodes that are accessed during 
each leaf-to-top traversal. Next, according to B^,, we sparsify ISA by discarding values related to 
suffixes that will not contribute to the construction of D (i.e., those values i for which there is no 
’0’ between the {i — l)-th and the i-th ’1’ in Bd). We align the resulting sparse ISA to the right of 
Ai. Afterwards, we overwrite Ai with D from left to right in a third round using the sparse ISA. 
The fact that this is possible is proved by the following 

Lemma 1. \D\ < n. 

Proof. First note that the size of D is \Vr\ + Zr, where Zr is the number of referencing factors 
(number of ‘I’s in Br). Hence, we need to prove that \Vr\ + Zr < n. Let z}. (resp. z^^) denote 
the number of referencing factors of length 1 (resp. longer than 1), and let (resp. denote 

the referred nodes whose string depth is 1 (resp. longer than 1). Also, Zf denotes the number of 
free letters. Clearly, \Vr\ = |Vj^| -|- Zr = z} + z^^, \V^\ < zj, and < z^^. Hence 

\Vr\ + Zr = |U/| -b + zl + z^^ < Zf + zl + 2z^^ < u. The last inequality follows from the 

fact that the factors are counted disjointly by Zf, z} and z^^, and the sum over the lengths of all 
factors is bounded by n, and every factor counted by z^^ has length at least 2. 

□ 

By Lemma dl D fits in Ai. Since each suffix having an entry in the sparse ISA has at least one 
entry in D, overwriting the remaining ISA values before using them will never happen. 

Once we have D on Ai, we start matching referencing factors with their referred positions. 
Recall that each referencing factor has one referred entry, and its referred position is obtained by 
matching the leftmost occurrence of its referred node in D. 

Let us hrst consider the easy case with \ Vr\ < [nej such that all referred positions fit into A 2 
(the helper array of size enlgn bits). By we know the leaf-to-top traversal number (i.e., the 
leaf’s label) during which we wrote D\i\ (for any 1 < i < |T1|). For 1 < m < lU-l; the zero-initialized 
A 2 [m] will be used to store the smallest suffix number at which we found the m-th referred node 
(i.e., the m-th node of W identified by pre-order). 

Let us consider that we have set A 2 [m] = k, i.e., the m-th referred node was discovered for the 
hrst time by the traversal of the suffix leaf labeled with k. 


7 


Whenever we read the referred entry D[i] of a factor / with factor position larger than k and 
PVr{^[^\) — ''^6 know by A 2 [m] = k that the referred position of / is k. Both the filling of A 2 and 

the matching are done in one single, sequential scan over D (stored in ^ 1 ) from left to right: While 
tracking the suffix leaf’s label with a counter 1 < A; < n, we look at t ;= A 2 \t\ for each 

array position 1 <i < |-D|: if ^ 2 [A] = 0, we set ^ 2 [A] k. Otherwise, D[i] is a referred entry of the 
factor / with factor position k, for which ^2 [A] stores its referred position. We set Ai[i] ^ ^ 2 [A]- 
By doing this, we overwrite the referred entry of every referencing factor f va D with the referred 
position of /. 

If \Vr\ > [nej, we run the same scan multiple times, i.e., we partition {1,..., | W|} into [I W| /(ne)] 
equi-distant intervals (pad the size of the last one) of size [neJ, and perform \\Vr \ /(ue)] scans. In 
order to skip the referred entries in D belonging to an already scanned part of Wj we use a bit 
vector that marks exactly those positions. Since each scan takes 0{n) time, the whole computation 
takes 0(|V).| /e) = C>[zle) time. 

Now we have the complete information of the factorization: The length of the factors can be 
obtained by a select-query on Bj, and Ai contains the referred positions of all referencing factors. 
By a left shift we can restructure Ai such that Ai[x\ tells us the referred position (if it exists, 
according to Br[x\) for each factor 1 < x < z. Hence, looking up a factor can be done in 0(1) time. 

3.2 Classic LZ77 factorization 

During the leaf-to-top traversals in Section 13.11 we have to account for the fact that the length 
of each referencing factor has to be enlarged (due to the fresh character). It suffices to mark the 
factors in Bf appropriately to the possibly modified lengths (Bf is used to retrieve position and 
length of any factor); the new shape of induces implicitly a modification of B^ and The 
fresh character that ends a referencing factor will never be considered to be a factor beginning. 
Finally, the fresh character of each referencing factor can be lookup up with Bf and T. Lemma [1] 
still holds for this variant of the factorization; in fact, since z^ = 0 and V'/ = 0, the proof gets 
easier. 

4 LZ78 

Common implementations use a trie for storing the factors. In the beginning, the trie just consists 
of the root. For each newly generated factor we append a leaf to the trie. If the parent of this leaf is 
the root, the factor is a free letter, otherwise it references the factor that corresponds to the parent 
node. Hence, each node (except the root) represents a factor. We call this trie the LZ78 trie. Recall 
that all trie implementations have a (log-)logarithmic dependence on a for top-down-traversals (see 
the Introduction); one of our tricks is using leveLanc queries starting from the leaves in order to 
get rid of this dependence. For this task we need ISA to fetch the correct suffix leaf; hence, we first 
overwrite SA by its inverse. 

4.1 Algorithm 

Interestingly, the LZ78 trie is superimposed on the suffix trie of T [51[3D]. Thus, the LZ78 trie 
structure can be represented by ST, with an additional DS storing the number of LZ78 trie nodes 
that he on each edge of ST. Each trie node v is called explicit iff it is not discarded during the 
compactification of the suffix trie towards ST ; the other trie nodes are called implicit. 


For every edge e of ST we use a counting variable 0 < rig < | c(e) | that keeps track of how far e 
is explored. If He = 0, then the factorization has not (yet) explored this edge, whereas He = |c(e)| 
tells us that we have already reached the ending node v ^ V of e =: {u,v). We defer the question 
how the Uf,- and |c(e)|-values are stored in enlgn bits to Sect. 14.21 as those technicalities might not 
be of interest to the general audience. 

Because we want to have a representative node in ST for every LZ78-factor, we introduce the 
concept of witnesses: For any 1 < x < z, the witness of fx is the ST node that is either the explicit 
representation of fx, or, if such an explicit representation does not exist, the ending node in ST of 
the edge on which fx lies. 

Our next task is therefore the creation of an array IF[1..2;] s.t. W[x] stores the pre-order number 
of fx's witness. With W it will be easy to find the referred index y of any referencing factor fx- 
That is because fy will either share the witness with fx, or W[y] is the parent node of VF[x]. Storing 
W will be done by overwriting the first 2 : positions of the array Ai. 

We start by computing W\x\ for all 1 < x < 2 : in increasing order. Suppose that we have already 
processed x — 1 factors, and now want to determine the witness of fx with factor position j. ISA[j] 
tells us where to find the ST leaf labeled with j. Next, we traverse ST from the root towards this 
leaf (navigated by leveLanc queries in deterministic constant time per edge) until we find the first 
edge e with Ue < |c(e)|, namely, e is the edge on which we would insert a new LZ78 trie leaf. It is 
obvious that the ending node of e is fx's, witness, which we store in IF[x]. We let the LZ78 trie grow 
by incrementing rig. The length of fx is easily computed by summing up the |c(-)|-values along the 
traversed path, plus Ug’s value. Having processed fx with factor position j E [x..n], ISA’s values in 
are not needed anymore. Thus, it is eligible to overwrite Ai[x\ by W[x\ for 1 < x < 2 ; while 
computing fx- Finally, Ai[1..2;] stores W. Meanwhile, we have marked the factor positions in a bit 
vector 

For our running example, we conducted the traversals, and marked the witnesses and LZ78 trie 
nodes superimposed by ST in Fig. [3l 

Matching the factors with their references can now be done in a top-down-manner by using W. 
Let us consider a referencing factor fx with referred factor fy. We have two cases: Whenever fy is 
explicitly represented by a node v (i.e., by fyS witness), v is the parent of fx's witness. Otherwise, 
fy has an implicit representation and hence has the same witness as fx- Hence, if W stores at 
position X the first occurrence of W\x\ in W, fy is determined by the largest position y < x for 
which W[y] = parent(u); otherwise (VF[x] is not the first occurrence of W[x\ in W), then the referred 
factor of fx is determined by the largest y < x with W\x\ = IF[y]. 

Now we hold W in Hi[1..2;], leaving us Ai\z + l..n] as free working space that will be used to 
store a new array R, storing for each witness w the index of the most recently processed factor 
whose witness is w. However, reserving space in R for every witness would be too much (there are 
potentially 2 ; many of them); we will therefore have to restrict ourselves to a carefully chosen subset 
of witnesses. This is explained next. 

First, let us consider a witness w that is witnessed by a single factor fx whose LZ78 trie node 
is a leaf. Because no other factor will refer to fx, we do not have to involve w in the matching. 
Therefore, we can neglect all such witnesses during the matching. The other witnesses (i.e., those 
being witnessed by at least one factor that is not an LZ78 trie leaf) are collected in a set Vs and 
marked by a bit vector My^. |I/=-| is at most the number Zi of internal nodes of the LZ78 trie, which 
is bounded by n — 2 ;, due to the following 

Lemma 2. z + Zi < n. 
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Fig. 3. The LZ78 trie for T = aaabaabaaabaaS is depicted by bullets on the suffix tree. The LZ78 factorization 
partitions T as ajaa|b|aab|aaa|ba|a$. W = [3, 5,18,10, 7,18, 4] and Vs = {3, 5,18}. 


Proof. Let a (resp. /?) be the number of free letters that are internal LZ78 trie nodes (resp. LZ78 
trie leaves). Also, let 7 (resp. 6) be the number of referencing factors that are internal LZ78 
trie nodes (resp. LZ78 trie leaves). Obviously, a + /3 + 7 + 5 = 2 ;. Wrt. the factor length, each 
referencing factor has length of at least 2, while each free letter is exactly one character long. Hence 
2(7 + 5) + a + /3 = z + 7 + 5 < re. Since each LZ78 leaf that is counted by 6 has an LZ78 internal node 
of depth one as ancestor (counted hy a), a < 5 holds. Hence, z + Zi^z + a + y^z + y + S^n. □ 

By Lemma m if we let R store only the indices of factors whose witnesses are in }/=, it fits into 
Ai[z + l..re], and we can use My_. to address R. 

We now describe how to convert W (stored in Ai[l..z]) into the referred indices, such that in 
the end Ai[x\ contains the referred index of fx for \ < x < z. We scan W = Ai[1..2;] from left to 
right while keeping track of the index of the most recently visited factor that witnesses u, for each 
witness v ^ Vs Rlpvsiy)]. Suppose that we are now processing fx with witness v = W[x]. 

— \i V or R[pvs{v)] is empty, we are currently processing the first factor that witnesses v. 

Further, if fx is not a free letter, its referred factor is explicitly represented by the parent of v. 

We can find its referred index at position pv^{'P^'c&'nt{v)) in R. 

— Otherwise, v € Vs, and Rlpv^iv)] has already stored a factor index. Then R[pv^{v)] is the 

referred index of fx- 

In either case, if u £ Vs, we update R by writing the current factor index x to R[pvs{v)]- Note that 
after processing fx, the value Ai[x\ is not used anymore. Hence we can write the referred index of 
fx to Ai[x\ (if it is a referring factor) or set Ai[x\ V- 0 (if it is a free letter). In the end, Ai[l..z] 
stores the referred indices of every referring factor. 

Now we have the complete information about the LZ78 factorization: For any 1 < x < z, is 
formed by fyC, where y = Ai[x\ is the referred index and c = T)!?/. selecti(x +1) — 1] the additional 
letter (free letters will refer to /o := e). Hence, looking up a factor can be done in 0(1) time. 
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4.2 Bookkeeping the LZ78 Trie Representation 

Basically, we store both Ue and |c(e)| for each edge e so as to represent the LZ78 trie construction 
in each step. A naive approach would spend 21g(maXeg_B |c(e)|) bits for every edge, i.e., 4nlgn bits 
in the worst case. In order to reduce the space consumption to enlgn + o(n) bits, we will exploit 
two facts: (1) the superimposition of the LZ78 trie on ST takes place only in the upper part of ST, 
and (2) most of the needed |c(e)|- and rig-values are actually small. 

More precisely, we will introduce an upper bound for the rig values, which shows that the 
necessary memory usage for managing the rig and |c(e)| values is, without a priori knowledge of the 
LZ78 trie’s shape, actually very low. 

Note that although we do not know the LZ78 trie’s shape, we will reason about those nodes 
that might be created by the factorization. For a node v ^V, let height(i;) denote the height of v in 
the LZ78 trie if v is the explicit representation of an LZ78 trie node; otherwise we set height(i;) = 0. 

For any node v £V, let l{v) denote the number of descendant leaves of v. The following lemma 
gives us a clue on how to find an appropriate upper bound: 

Lemma 3. Let u,v with e := {u,v) £ E. Further assume that u is the explicit representation 
of an LZ78 trie node. Then height(u) is upper bounded by l{v) — |c(e)|. 

Proof. Let tt be a longest path from u to some descendant leaf of v, and d := height(u) -|- |c(e)| 
(i.e., the number of LZ78 trie edges along vr). By construction of the LZ78 trie, the ST node v must 
have at least d leaves, for otherwise the (explicit or implicit) LZ78 trie nodes on tt will never get 
explored by the factorization. So d < l{v), and the statement holds. □ 

Further, let root denote the root node of the suffix trie. In particular, root is an explicit LZ78 trie 
node. Consider two arbitrary nodes u,v G V with e := {u, v) G E. Obviously, the suffix trie node of 
V is deeper than the suffix trie node of u by |c(e)|. Putting this observation together with LemmaO 
we define h : V ^ No, which upper bounds height(-): 


h{v) 


n if u = root, 

max (0, min (/i(u), ^(u)) — |c(e)|) if there is an e := (u,v) G E. 


Since the number of LZ78 trie nodes on an edge below any u G N is a lower bound for height(u), 
we conclude with the following lemma: 


Lemma 4. For any edge e = {v,w) G E, ne< min(|c(e)| ,h{v)). 

Let us remark that Lemma |4] does not yield a tight bound. For example, the height of the LZ78 
trie is indeed bounded by \/^ (see, e.g., [21 Lemma 1]). But we do not use this property to keep 
the analysis simple. 

Instead, we classify the edges e G E into two sets, depending on whether Ug < Z\ := 
holds for sure or not. By Lemma 01 this classification separates E into E<^ := {{u,v) G E : 
min (|c((u, u))| ,h{u)) < A} and £'>z\ := E \ E<^. Since 21gZ\ bits are enough for bookkeeping 
any edge e G E<^, the space needed for these edges fits in 2 IgA < nelgn bits. Thus, our 

focus lies now on the edges in F'>z\; each of them costs us 2 Ig n bits. Fortunately, we will show that 
|T'>z\| is so small that the space of 2 |£'>z\| Ign bits needed by these edges is in fact o(n) bits. 

We call any e G F'>z\ a A-edge and its ending node a A-node. The set of all A-nodes is 
denoted by Va. As a first task, let us estimate the number of A-edges on a path from a node 
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V £ Va to any of its descendant leaves; because n is a Z\-node with height(f) < h{v), this number 


is upper bounded by 


h[v) 

< 

l{v)—A 


l{v) 

A 


A 


A 


— 1. For the purpose of analysis, we introduce 


h : (Va U {root}) —)■ Nq, which upper bounds the number of Z\-edges that occur on a path from a 
node to any of its descendant leaves: 


h{v) = 


{' 


Z\J 

min 






-1 


if u = root, 
otherwise. 


where p : Va ^ {Va U {root}) returns for a node v either its deepest ancestor that is a zA-node, or 
the root if such an ancestor does not exist. Note that h is non-negative by the definition of Va- 
For the actual analysis, a{v) shall count the number of Z\-edges in the subtree rooted at u € 
Va U {root}. 

Lemma 5. For any node v G {Va U {root}), a{v) < ^ Yli=i i- 

Proof. We proceed by induction over the values of h{v) for every v G Va- For h{v) = 0 the subtree 
rooted at v has no Z\-edges; hence a{v) = 0. If h{v) = 1, any Z\-node w of the subtree rooted at v 
holds the property h{w) = 0. Hence, none of those Z\-nodes are in ancestor-descendant relationship 


to each other. By the definition of Z\-nodes, for any Zl-node u, we have 0 < 


l{u) 


— 1, and hence, 


A <l{u). By Aa{v) < Y1 u&Va,pV)=v ^ get a{v) < 

For the induction step, let us assume that the induction hypothesis holds for every u G Va with 

h{u) < k. Let us take av £ Va with h{v) = k. Further, let 14/ := |u G I4\ : p{u) = v and h{u) = /c'| 

for 0 < A:' < A: — 1 denote the set of Zl-nodes that have the same h value and are descendants of 
V, without having a Z\-node as ancestor that is a descendant of v. These constraints ensure that 
there does not exist any u £ Uo<fc'<fc-i ^ ancestor or descendant of some node of 

V. Thus the sets of descendant leaves of the nodes of V are disjoint. So it is eligible to denote 
by Lfc/ := l{u) the number of descendant leaves of all nodes of 14/. It is easy to see that 

Ylk~=o^k' < Now, by the hypothesis, and the fact that each u G V is the highest Z\-node on 

every path from v to any leaf below u, we get 


k-l 


a 


w s i‘'»i + E I E ^ E} + = ir„i + E E i + lu 

k' = l \ueVy i=l ^ I k' = l \ i=l ^ 


By definition of 14/ and h, we have h{u) = k' < 

Lui l(u) 


A 


— 1 and hence {k' + V)A < l{u) for any 


u£Vk'. This gives us = Hu&Vy {k'-^i)A sum, we get 

, k—l ^ k'+l , fc—1 ^ k'+l ^ It \ k 

To 


a{v) < 


A 


Ly v“~^ 1 Lfc/ 1 K^^) 1 

+ EwE7 = EwE7S-4E7 


A ' I 

k'=l i=l 


A 

k'=0 i=l 


2 = 1 


□ 


By Lemma m \E^a\ = a(root) < Since y E 1 -|-ln^, we have a(root) < 

^ + = 0[^lg^) = 0[nlgn/ (n'^/'^)). We conclude that the space needed for E^a is 

2 \EyA \Ign = = o{n) bits. 
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Finally, we explain how to implement the data structures for bookkeeping the LZ78 trie rep¬ 
resentation. By an additional node-marking vector that marks the V^-nodes, we divide the 
edges into i?<z\ and rank / select on My^ allows us to easily store, access and increment the 

rig values for all edges in constant time. My^ can be computed in 0{n) time when we have SA on 
Ai: since str_depth allows us to compute every |c(e)| value in constant time, we can traverse ST in 
a DFS manner while computing h{v) for each node v, and hence, it is easy to judge whether the 
current edge belongs to In order to store the h values for all ancestors of the current node 

we use a stack. Observe that the h values on the stack are monotonically increasing; hence we can 
implement it using a DS with 0(n) bits [8l Sect. 4.2]. 
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